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Abstract 

Background: Both scientists and the public routinely refer to randomized controlled trials (RCTs) as being the 'gold 
standard' of scientific evidence. Although there is no question that placebo-controlled RCTs play a significant role 
in the evaluation of new pharmaceutical treatments, especially when it is important to rule out placebo effects, 
they have many inherent limitations which constrain their ability to inform medical decision making. The purpose 
of this paper is to raise questions about over-reliance on RCTs and to point out an additional perspective for 
evaluating healthcare evidence, as embodied in the Hill criteria. The arguments presented here are generally 
relevant to all areas of health care, though mental health applications provide the primary context for this essay. 

Discussion: This article first traces the history of RCTs, and then evaluates five of their major limitations: they often 
lack external validity, they have the potential for increasing health risk in the general population, they are no less 
likely to overestimate treatment effects than many other methods, they make a relatively weak contribution to 
clinical practice, and they are excessively expensive (leading to several additional vulnerabilities in the quality of 
evidence produced). Next, the nine Hill criteria are presented and discussed as a richer approach to the evaluation 
of health care treatments. Reliance on these multi-faceted criteria requires more analytical thinking than simply 
examining RCT data, but will also enhance confidence in the evaluation of novel treatments. 

Summary: Excessive reliance on RCTs tends to stifle funding of other types of research, and publication of other 
forms of evidence. We call upon our research and clinical colleagues to consider additional methods of evaluating 
data, such as the Hill criteria. Over-reliance on RCTs is similar to resting all of health care evidence on a one-legged 
stool. 



Background 

The fact that so many pathological syndromes are 
named after the individual who first characterized the 
disorder illustrates that medicine has always valued 
good clinical observations. In fact, one could argue that 
most major discoveries in health have evolved from 
observations first documented in case reports and case 
series. Randomized controlled trials (RCTs) are often 
mounted later to validate an intervention, especially in 
comparison with a placebo, but as is well-recognised, 
they are rarely major sources of scientific discovery. In 
spite of the fact that observational data hold a time- 
honored place in medicine, 21^*' century methodology 
has pre-empted several millennia of historical tradition 
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by anointing RCTs with the descriptive phrase 'gold 
standard' of evidence [1]. 

There are many reasons why RCTs have become the 
de facto standard by which all forms of evidence are 
evaluated. No other study design rivals the RCT's ability 
to eliminate selection bias and reduce the risk of a ser- 
ious imbalance in known and unknown factors that 
could influence outcomes (when the randomization pro- 
cedure is executed properly). Our concern is that evi- 
dence based medicine has made a leap from considering 
RCTs to be a high standard to being the only standard. 
The primary purpose of this paper is to question the 
over-valuation of RCTs that Rosenbaum referred to as a 
form of tyranny [2]. Our premise is that over-reliance 
on RCTs has resulted in a foundation for decision-mak- 
ing in health care that is as unstable as a one-legged 
stool. The history and limitations of RCTs are first sum- 
marized and then followed by suggestions for an 
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alternative approach to the evaluation of evidence. The 
second purpose of this paper is to propose greater use 
of alternative forms of evidence. Depending on a variety 
of sources leads to more stable and error-free conclu- 
sions by providing other legs to stabilize the stool upon 
which medical decision-making can rest. 

History of RCTs 

The use of randomized controlled trials predates the 
actual term RCT by several centuries. One of the ear- 
liest documented applications to health was the proof 
that citrus prevented scurvy [3]. Scurvy routinely killed 
> 50% of sailors on long voyages, no small impediment 
to world exploration in the 14*- 18* centuries. A diet- 
ary factor was suspected in 1601 when Captain James 
Lancaster administered a tablespoon of lemon juice per 
day to each sailor on one ship: 0% of those on the ship 
with lemon juice rations died, while 40% of those on 
the three ships without lemon juice were dead halfway 
through the journey to India. Replications and exten- 
sions followed: e.g., in the mid-lS*^*^ century a ship phy- 
sician named James Lind conducted a study in which 
early signs of scurvy were effectively treated in those 
randomly assigned to receive citrus, thereby showing 
the ability of citrus to reverse the disease in its early 
stages [3]. 

It was Sir Austin Bradford Hill (sometimes referred to 
as Bradford Hill, or Bradford-Hill), a British statistician 
and epidemiologist, who promoted the use of randomi- 
zation for clinical trial research employed to test health 
care interventions, a position he took prior to World 
War II [4]. However, the issue became more prominent 
in 1946 when the British Medical Research Council was 
investigating the effect of streptomycin on tuberculosis. 
The extreme shortages of streptomycin in England 
caused considerable stress amongst physicians who were 
constrained to use existing therapies despite reports of a 
promising new intervention. The consensus at the time 
was that the "small supply of streptomycin allocated to 
it for research purposes would be best employed in a 
rigorously planned investigation with concurrent con- 
trols" (page 769) [5]. 

In 1962, RCTs were still quite rare, yet they were the 
norm in 1992 when the Evidence-Based Medicine 
Working Group published their seminal paper [6]. The 
escalation of the importance of RCTs in this 30-year 
period was influenced worldwide by significant decisions 
made in the United States regarding premarket approval 
of drugs by the Food and Drug Administration (FDA). 
It was section 355(d) of the 1962 Drug Amendments to 
the American Food, Drug, and Cosmetic Act which 
changed procedural requirements in the United States 
[7]. This clause was the first time FDA introduced the 
requirement of what they referred to as 'effectiveness' 



for its approval, in addition to the previous requirement 
of safety, a change that led directly to incorporating ran- 
domization and blinding into studies [7]. According to 
Kulynych's historical review, consideration of public 
safety was the basis of the requirement for effectiveness: 
if an ineffective drug replaces one of proven value, peo- 
ple can be harmed. Thus, even though the primary man- 
date had previously been safety of drugs, the rationale 
that led in 1962 to section 355(d) evolved from recogni- 
tion of the importance of effectiveness for the demon- 
stration of safety. Litigation that followed the Drug 
Amendment clarified that RCTs would be required as 
proof of efficacy, and hence meeting the safety require- 
ment. The new statute also required that pharmaceutical 
companies provide "substantial evidence" consisting of 
"adequate and well-controlled investigations, by experts 
qualified by scientific training" to demonstrate effective- 
ness of a new drug. Subsequent legal interpretation clar- 
ified that two RCTs would be required to demonstrate 
that effectiveness [7]. 

The new effectiveness criterion had a significant 
impact on standards of scientific evidence for the next 
30 years; however, by the end of the 20th century, even 
FDA was beginning to question the need for multiple 
Phase III RCTs as proof of effectiveness to justify mar- 
ket approval [7]. Criticisms were based primarily on cost 
and inefficiency: clinical trials for market approval cost 
millions of dollars. In the 1997 Modernization Act of 
FDA, the requirement of two RCTs was softened to one, 
but RCTs continued to be the gold standard for market 
approval of drugs. 

Since the era of Hill's methodological contributions, 
various groups have promoted the idea of levels of evi- 
dence, but recently, some have questioned the position 
of RCTs in a hierarchical model. For instance, Ghaemi 
commented that "...the key feature of levels of evidence 
to keep in mind is that each level has its own strengths 
and weaknesses, and as a result, no single level is com- 
pletely useful or useless" (p. 10) [8]. Further, Walach and 
colleagues emphasized that the RCT hierarchy of evi- 
dence is based on the pharmacological model of treat- 
ment, and is not always appropriate for the evaluation 
of interventions [9]. They argued for a Circular Model, 
based on many methodologies and designs - what 
might be considered a return to the historical principle 
of depending on the 'weight of the evidence.' The Circu- 
lar Model poses the idea that experimental methods 
(such as RCTs) used to evaluate efficacy need to be 
complemented by other methods that take into account 
real-life issues and clinical applicability. As Walach and 
colleagues conclude, "Rather than postulating a single 
"best method" this view (the Circular Model) acknowl- 
edges that there are optimal methods for answering spe- 
cific questions, and that a composite of all methods 
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constitutes best scientific evidence." Some areas of 
science refer to this as a 'multi-method' approach. 

Limitations of RCTs 

RCTs bolster confidence in causal claims related to the 
effects of a treatment by eliminating threats to internal 
validity. They do this by using the tools of random 
assignment and experimental control. However, a medi- 
cal science that relies primarily upon achieving internal 
validity with a relative neglect of external validity (as, we 
argue, many RCTs do) is at great danger of ignoring the 
individual and context characteristics that impinge upon 
treatment outcome. The five criticisms of RCTs 
reviewed below redirect us to consider a more diverse 
medical science. 

1. RCTs usually lack external validity. For any given 
study, clinicians should reasonably ask, to whom do 
these results apply? Particularly in mental health, it has 
been shown that RCTs tend to employ such strict inclu- 
sion and exclusion criteria that the participants are not 
representative of the general population of individuals 
with a given disorder. As Concato and colleagues 
pointed out [10] (p. 1891): "...an observational study 
would usually include patients with coexisting illnesses 
and a wide spectrum of disease severity." But the most 
typical characteristic that excludes people from RCTs is 
just that - the presence of a co-existing disorder [11,12]. 
Deisboeck [13] observed (p. 2) that "the status quo strat- 
egy in medical practice is as simple as it now appears to 
be intrinsically flawed: carefully assess a patient's symp- 
toms to diagnose his specific disease patterns only to 
then treat it with a protocol that is based on the 
assumption that most interpersonal characteristics are 
rather inconsequential for treatment outcome." 

The external validity (or ecological validity) and gener- 
alizability of RCTs have been questioned for a number 
of years [14], and a recent report provided an elegant 
test of the issue in the context of psychiatry. The study 
known as the Sequenced Treatment Alternatives to 
Relieve Depression (STAR*D) is a well-known, multi- 
centre prospective trial of medications for major depres- 
sive disorder [12]. Adults with major depressive disor- 
der, and no history of bipolar, psychosis, eating disorder, 
or obsessive compulsive disorder, were invited to parti- 
cipate in a study that began with an evaluation of citalo- 
pram. The investigators analyzed data collected on 
almost 3,000 STAR*D participants to address the ques- 
tion of whether Phase III RCTs are studying patients 
who are representative of depressed outpatients in the 
population. Using standard clinical trial criteria, they 
separated participants into those who met the inclusion 
criteria for a Phase III RCT (N = 635) and those who 
did not (N = 2,220). In other words, only 22% of the 
STARED participants would have passed screening for a 



traditional RCT. So in answer to the question raised 
above ('to whom do these results apply?'), it appears 
that the information obtainable from a traditional RCT 
with this sample would not have been directly relevant 
for 78% of people suffering from major depressive disor- 
der. The STAR*D trial, with its generally more inclusive 
eligibility criteria, is actually an example of how external 
validity can be improved in RCTs. 

2. In the long term, RCTs may actually increase health 
risks in the general population. As a corollary to the 
issue of external validity, excessively constrained sam- 
pling approaches can have consequences for population 
health. Health researchers sometimes distinguish 
between efficacy and effectiveness. One might say that 
all RCTs are efficacy studies (though the terminology 
becomes confusing, as they are submitted to FDA as 
evidence of effectiveness as spelled out in section 355 
(d)), demonstrating benefit in ideal conditions - particu- 
larly by selecting only relatively 'pure' samples of people 
with no co-existing problems. The lack of external valid- 
ity caused by these atypical samples means that the 
drugs are approved without evidence of effectiveness 
(defined as benefit in the broader population, under less 
ideal circumstances). Since people with complex health 
problems (e.g., hypertension or heart disease) are not 
usually participants in RCTs evaluating mental health 
treatments, it is reasonable to consider the possibility 
that patients prone to adverse effects are not studied. 
But approval of the drug then has the potential of 
increasing the risk of those vulnerable individuals to 
exacerbation of their pre-existing health problems. One 
could argue that effectiveness studies using designs such 
as case-control methodology that have more external 
validity would be more informative about the health 
impacts of a new drug for the broader population. With- 
out such forms of 'observational' studies, drug approval 
based only on RCTs may be increasing the health risks 
to the populace. 

3. The premise that RCTs are the only form of evidence 
capable of providing an unbiased estimate of treatment 
effects is false. A fundamental reason for the elevated 
status of RCTs is the conviction held by many that all 
other forms of evidence, even cohort and case-control 
studies, overestimate treatment effects. However, some 
published research does not support this premise. Con- 
cato and colleagues [10] evaluated 99 reports of five dis- 
tinct clinical topics and could find no meaningful 
differences in the treatment effects on a broad array of 
clinical outcomes obtained from RCTs compared to 
observational data. As they pointed out, the literature 
on psychological, educational, and behavioral treatments 
have revealed similar findings: no difference in effect 
magnitude reported from RCTs vs. observational studies 
[15]. So even though randomization, blinding, and 
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placebo controls contribute to a high degree of internal 
validity, requiring all evidence to fit this model appears 
to be unjustified. 

4. RCTs are unable to tell clinicians everything they 
really want to know. Comparisons between groups of 
individuals can obscure processes operating within indi- 
viduals. However, RCTs only reveal differences between 
treatment and control group means - these aggregated 
results are uninformative of the potential benefit of a 
treatment for any individual in the study, and more 
importantly, for the individual who was not in the study 
but whose treatment decisions will nevertheless be 
made on the basis of the study [1]. What clinicians 
really want to know is whether or not the person sitting 
before them is likely to benefit. The averaged results 
derived from RCTs offer insufficient or even incorrect 
guidance on how to approach a specific case [13]. There 
is no doubt that RCTs provide high quality information 
about treatments which should be considered, especially 
where stratified groups have been included in the analy- 
sis. Nevertheless, additional forms of evidence that expli- 
citly include individual and context characteristics are 
needed to assist clinicians in choosing a course of action 
regarding specific patients. Single case experiments, epi- 
demiological data, qualitative data and field reports from 
clinicians using an intervention are examples of such 
additional sources of information. 

5. The excessive expense of RCTs leads to vulnerabil- 
ities in the quality of evidence. On average, a Phase III 
clinical drug trial costs > 15 million USD [16], which 
raises several important issues. 

• Part of the cost of an RCT in some countries is 
significant payment to participants, which has led to 
a practice of 'guinea-pigging' [17-19] in which some 
people volunteer for research to gain money or free 
physical exams. In mental health research it is easy 
for people to fake symptoms to gain access to these 
rewards, as there are no biomarkers of mental disor- 
ders. One of many problems with this practice is 
that there is no oversight body to ensure that people 
have not recently participated in some previous 
RCT; hence, there is no assured washout period 
between exposures to different drugs. 

• The expense of RCTs adds to the cost of medica- 
tions that are eventually approved. Hence, the 
broader population has the right to ask if they 
should be paying for research that has questionable 
generalizabilty to the general populace. 

• The significant costs of RCTs add pressure to bury 
negative reports that will prevent a medication from 
moving to market, thereby removing the possibility 
of development cost recovery by the pharmaceutical 
company. The total cost of developing a new drug is 



now in excess of US$1.3 billion [16] - a substantial 
investment to lose in the case where the promise of 
a new drug is not realized. The analysis by Turner 
and colleagues highlighted this concern [20] by ana- 
lyzing publication bias in the literature on anti- 
depressants. In 74 clinical trials of antidepressants, 
37 of 38 positive studies were published. But of the 
36 negative studies, 33 were either not published or 
published in a form that conveyed a positive out- 
come. Such evidence for a strong and consistent 
commercial distortion in the medical data base used 
to support evidence-based medicine is very worri- 
some. Without access to published reports that fail 
to demonstrate treatment efficacy, the weight of evi- 
dence becomes biased in favor of the treatment. 
Clinical trial registration systems are, of course, 
beginning to address this problem. 

♦ The expense of RCTs biases their implementation 
toward areas for which commercial funding is possi- 
ble; i.e., pharmaceutical interventions. Million dollar 
grants are less available for the evaluation of non- 
conventional forms of treatment (natural health pro- 
ducts, acupuncture, psychotherapy, etc.). When the 
valuation of RCTs as "gold standard" is combined 
with a systematic bias toward commercial applica- 
tions, the weight of evidence itself becomes biased in 
favor of pharmaceutical treatments. 

• A final adverse side effect of the high costs of 
RCTs, at least with regard to psychiatric drugs, is 
the brief intervention period that is typically 
employed to evaluate efficacy. Most RCTs in mental 
health now last only 6-8 weeks, which is more eco- 
nomical than a 12-week or longer trial more typical 
a decade ago (see, for example [21]). But clinicians 
have to make decisions about long-term treatment 
for their patients, usually based on this very short- 
term information. An additional weakness that this 
trend imposes is that safety issues resulting from 
long term use are often unknown. 

The unfortunate result of these financial pressures is 
that clinical decisions are made to administer medica- 
tions to patients for many years, but the decisions them- 
selves may be based on very brief trials conducted with 
very unusual people (with no other health problems, for 
instance), and without evidence for other interventions 
which may be effective. 

In conclusion, dismissing or devaluing rigorously col- 
lected data obtained with study designs other than 
RCTs results in a science that is inherently unsupporta- 
ble - as shaky as a one legged stool. Indeed, as Concato 
and colleagues argued, the evidence does not support 
the commonly accepted concept of a hierarchy of study 
designs employed for clinical research [10]. If RCTs are 
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not the only or even the most important evidence for 
evaluating effectiveness, then we need to ask what other 
criteria we can use to support a rigorous evidence-based 
medicine. This question brings us full circle, back to Sir 
Austin Bradford Hill. 

The Nine Hill Criteria for Defining Causality 

Hill was primarily concerned with causation of disease 
when he outlined nine considerations for determining cau- 
sal relations in epidemiology, using the association 
between smoking and lung cancer as his illustration. How- 
ever, the criteria he defined in his 1965 presidential 
address to the Section of Occupational Medicine of the 
Royal Society of Medicine [22] are also employed to evalu- 
ate causal explanations in other contexts [23]. The criteria 
will be addressed with special reference to mental health 
issues in order to illustrate their potential utility for evalu- 
ating treatments in an area of clinical science where medi- 
cal and non-medical approaches are often combined. 

1. Strength. The strength of association between an 
outcome and a putative causative agent is an important 
signal of a causal relationship. All things being equal, a 
strong association is less likely to occur from extraneous 
than causal effects. Hill did caution that no matter how 
slight an association may be, it should not be dismissed 
until argument for or against causality exists, and he 
used as an illustration the evidence that relatively few 
persons harboring meningococcus actually become ill 
with meningococcal meningitis. As discussed above 
(item 4), the fact that a causal relation between two 
variables can be detected within persons but are not 
observed when data are aggregated should caution 
against methods that rely exclusively upon indices of 
group differences. Therefore, methodology that relies 
solely on mean differences is a limited approach, and 
alternatives that accommodate individual differences 
may bolster conclusions. Such alternatives might include 
qualitative methods [1] or within-subject crossover 
designs that are able to demonstrate on-off control of 
symptoms in subgroups of the sample. 

Defining the strength of association between treat- 
ments and symptom severity has been an especially con- 
tentious issue in psychiatry, particularly with respect to 
depression. There has been much concern that publica- 
tion bias against negative findings, discussed above, 
results in approval of medications in spite of multiple 
trials failing to show benefit over placebo. Using Free- 
dom of Information legislation to gain access to unpub- 
lished studies of anti-depressant efficacy, Kirsch and 
colleagues showed very little difference between medica- 
tion and placebo in 35 RCTs on four SSRIs [24]. They 
used symptom change on the Hamilton Rating Scale of 
Depression (HRSD) as the outcome measure, and con- 
firmed previous findings: there was an improvement of 



9.6 points for the medication group and 7.80 for placebo 
controls. Although this 1.8 point spread is statistically 
significant, it is below 3.0, the value necessary for clini- 
cal significance used by the National Institute for Clini- 
cal Excellence (NICE). 

2. Consistency. Hill used this term to refer to obtaining 
similar results across different research sites and meth- 
odologies, something we might now call replication. As 
he pointed out, repeating studies is necessary to prove 
the obtained association is not a result of confounding 
variables in one setting or group. Similar results from 
independent researchers using different methods are 
more convincing than a single study. In the social 
sciences, this is generally referred to as method triangu- 
lation, or multi-methods [9]. 

3. Specificity. Hill recognized that two different 
patients may have varied outcomes from treatment sim- 
ply based on individual variables. Accordingly, it is not 
always possible to demonstrate specificity, even when a 
causal association exists. For example. Hill noted that 
smokers have a higher death rate than non-smokers for 
many causes of death. However, the relative increase for 
other diseases is modest (10-20%) while the increase for 
lung cancer is 900 - 1000%. Such specificity in the mag- 
nitude of the association provides important evidence 
for a causal association. In contrast, there are multiple 
determinants of mental disorders (psychosocial, biologi- 
cal, societal), making it very difficult to estimate the spe- 
cific contribution of any particular predictor variable. 
Because of this multicausality, this criterion may not 
always be applicable to evaluating causality in areas such 
as mental health, where family dynamics and other 
social issues can play such a prominent role. Qualitative 
methods may also play a role in the generation of 
hypotheses in such situations. 

4. Temporality. Temporality refers to the common 
sense notion that the cause always precedes the out- 
come. Temporality is crucial for determining direction 
of causality: e.g., does the decrease in a factor result in a 
disease, or does the disease result in the decrease of that 
factor? In mental health, elucidating this relationship 
can be difficult, but not impossible. For instance, 
within-subject crossover study designs (e.g., ABAB 
where A is the active treatment and B is a period of pla- 
cebo, or at least removal of the active treatment) can be 
useful for investigating the effect of a treatment. Assum- 
ing the existence of minimal carryover effects and suffi- 
cient time devoted for washout, this methodology can 
show (a) whether there is an improvement in the 
patient's condition, and (b) whether the treatment was 
actually the factor causing the improved outcome and 
not a confounding variable. We note that when it is 
possible to randomly allocate the treatment sequence, 
such cross-over designs can be considered RCTs. 
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5. Biological Gradient. A biological gradient is best 
described as a dose-response curve: increased treatment 
would presumably result in a proportionate increase in 
the effect. Hill realized that this criterion might not be 
applicable to all research fields, and he recommended 
that it be considered only when logical. Within psychia- 
try, outcomes vary largely because of individual differ- 
ences. For example, it is clear that the optimum dose of 
a given medication is not necessarily the highest one 
tested. Even in the Kirsch meta-analysis described 
above, the relationship between severity and response to 
medication was not linear [24]. Where the biological 
gradient for a disorder is complex (as in mental health), 
this criterion is not necessarily applicable for establish- 
ing causality. 

6. Plausibility. Plausibility refers to whether the cause 
and effect can be reasonably connected given the current 
state of knowledge within the discipline. Importantly, 
Hill states that new findings must not be immediately 
dismissed if they do not fit in with current knowledge 
("dogma") on the subject. As Kuhn observed, consider- 
able evidence disconfirming the accepted view must 
accumulate before new ways of thinking can emerge 
from new data [25] . Furthermore, a new treatment could 
alleviate symptoms of a disorder, but be disregarded 
because its mechanism of action may be unknown. 

7. Coherence. Similar to plausibility, coherence refers 
to the agreement of a study's findings with what is 
already known. According to Hill, the cause-effect inter- 
pretation of the data should not seriously interfere with 
current knowledge of the disease. Furthermore, Hill 
mentioned that basic lab evidence should not be a 
requirement, primarily because some outcomes would 
be difficult to demonstrate in a controlled environment. 
For instance, the search for animal models of human 
mental disorders has a long history, but no matter how 
promising some of the achievements, science can never 
validate an animal model for internal mental functions 
such as delusions, suicidality, euphoria, or hallucina- 
tions. For this reason, the criterion of coherence may 
not be a requisite standard for evaluating a novel 
approach to something like mental disorders. 

8. Experiment. The criterion of experimental evidence 
can be fulfilled in many ways. One reason for the wide 
acceptance of RCTs for pharmaceuticals is the well- 
documented placebo/expectancy effect in psychiatry. As 
Kirsch et al. commented regarding their meta-analysis, 
the longer-term improvements for medication observed 
in those 35 RCTs seemed to be a result of the decreas- 
ing placebo response which had been quite high (> 80%) 
[24]. Placebo-controlled randomized trials provide one 
type of design to control for placebo effects. Other alter- 
natives include within-subject crossover designs and 
case-control studies. 



9. Analogy. The notion that a similar cause results in a 
similar outcome is referred to as analogy. Hill described 
this criterion as accepting less evidence based on pre- 
vious results, citing the case of pregnancy and thalido- 
mide. If a new drug were to demonstrate negative 
consequences during pregnancy, we would be less hesi- 
tant to stop its use even if little evidence of harm exists, 
because of the tremendous social and personal cost that 
resulted from thalidomide. This criterion may not be 
relevant to all areas of health care treatment, particularly 
mental health. As with the criteria of specificity and bio- 
logical gradient, the issue comes down to the multicaus- 
ality of mental disorders and the multifinality of patient 
outcomes. Two patients presenting with similar illness 
sometimes respond quite differently to an identical 
treatment. Likewise, the justification that a past patient 
responded positively to a drug does not ensure that the 
current patient will, which introduces uncertainty about 
the nature of the association between the treatment and 
the outcome. 

Hill emphasized that not all nine criteria would be 
applicable to all situations. For instance, five appear to 
be applicable for the demonstration of causality in men- 
tal health: Strength, Consistency, Temporality, Plausibil- 
ity, and Experiment. For any application, common sense 
needs to prevail when considering criteria to evaluate 
causaUty. 

Discussion 

One could argue that over-reliance on RCTs has fos- 
tered a less critical form of thinking in the evaluation of 
health care treatments. Several years ago Smith and Pell 
wrote a satirical, insightful commentary on the need to 
do an RCT of the effectiveness of parachutes for the 
prevention of major trauma caused by gravity [26]. They 
concluded that people "...who insist that all interventions 
need to be validated by a randomised controlled trial 
need to come down to earth with a bump" (p. 1460). 
We suggest that ignoring data from sources other than 
RCTs results in a one-legged stool that brings progress 
in health treatment down with a bump. 

The methods we use constrain the types of observa- 
tions we can make. Because of this, it is important to 
use as many different sources of information as possible. 
Multi-method research can provide converging evidence 
on treatment effects, where "multi-method" refers to 
obtaining diverse sources of information that are mini- 
mally related to the existing sources. Unfortunately, it is 
increasingly difficult to fund or publish studies that are 
not RCTs. While the majority of social scientists fly the 
multi-method banner, it is RCTs that primarily hold the 
attention of health researchers. This dependence on 
RCTs means that the weight of evidence is precariously 
balanced upon a single method, a clear example of the 
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instability of a one-legged stool. Because inferences from 
clinical research propagate to clinical practice, failure to 
consider multiple sources of information compromises 
the foundation on which medical decisions are based, 
and on which the fate of lives may rest. 

Summary 

In summary, over-reliance on RCTs a) has been influ- 
enced in part by market pressures relevant to pharma- 
ceutical companies, b) was stimulated significantly by 
the 1962 amendments to the American Food, Drug, and 
Cosmetic Act, and c) is not scientifically sound. As Par- 
ker stated (p. 971) [1], "...it seems imprudent to assume 
that one type of methodology provides the only path to 
knowledge." There are alternatives to depending solely 
on RCTs, especially from the perspective provided by 
the Hill criteria, which enable us to more fully evaluate 
treatments in health care. 
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